Feature/validator.ValidateMulti()#405
Feature/validator.ValidateMulti()#405freemans13 wants to merge 16 commits intobsv-blockchain:mainfrom
Conversation
Fixes chain validation failure when syncing more than 10,000 blocks. The bug occurred because GetBlockHeadersFromOldest includes the starting block (uses >= in SQL), causing duplicate headers at iteration boundaries. When catchup spans multiple iterations (every 10,000 blocks), the last header from iteration N becomes the first header in iteration N+1, creating a duplicate that breaks chain validation. Changes: - Skip first header in iterations 2+ if it matches the last header from the previous iteration - Handle edge case where all headers are duplicates (chain tip reached) - Add debug logging for duplicate detection Tests: - Add comprehensive multi-iteration tests covering 10K, 25K header scenarios - Test edge cases (single duplicate header, exact boundary conditions) - Verify chain continuity across iteration boundaries - Validate header chain cache builds correctly with multi-iteration headers All existing tests pass. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
|
🤖 Claude Code Review Status: Complete Summary: This PR introduces significant architectural changes: a custom worker pool for subtree validation and concurrent catchup sessions with sequential validation ordering. The implementation is generally sound, but several areas need attention. Key Issues:
See inline comments for details. The core concurrent catchup session management looks well-designed with proper session tracking and metrics. |
…o feature/validator-validate-multiple
| for _, result := range pool.results { | ||
| if result.err != nil { | ||
| // Do not wrap the error again, the error is already wrapped | ||
| return result.err |
There was a problem hiding this comment.
Critical: Early return on first error loses other validation errors
The current implementation returns immediately on the first subtree validation error found. This means if multiple subtrees have validation errors, only the first one discovered (which could be any due to concurrent execution) will be reported.
This makes debugging harder because developers won't see all the validation failures at once. Consider either:
- Collecting all errors and returning a combined error
- Documenting this "fail-fast" behavior explicitly in comments
The previous errgroup implementation had the same behavior, but it's worth considering if this is the desired outcome for block validation.
| // Close closes the job channel and waits for all workers to finish | ||
| func (p *subtreeWorkerPool) Close() { | ||
| close(p.jobs) | ||
| p.wg.Wait() |
There was a problem hiding this comment.
Potential resource leak: Context cancellation without cleanup
The Shutdown() method cancels the context and closes the channel, but Close() just closes the channel without canceling. If a worker is blocked on p.ctx.Done() and the caller uses Close() instead of Shutdown(), the context will never be canceled.
However, I see this is only called in one place (Block.go:700) and it uses Close(). Since the parent context passed to the pool should handle cancellation via the tracing/request lifecycle, this may be acceptable.
Consider either:
- Documenting when to use Close() vs Shutdown()
- Removing Shutdown() if it's never used
- Always canceling the context in Close() for safety
| // Subtree validation is ~97% I/O (file reads from blob store) | ||
| // High concurrency needed to saturate disk I/O throughput | ||
| // On 8-core machine: 512 workers | ||
| numWorkers := runtime.GOMAXPROCS(0) * 64 |
There was a problem hiding this comment.
Extremely high default worker count: 64x CPU cores
This defaults to 512 workers on an 8-core machine. While the comment justifies this for I/O-heavy operations, this is an unusually aggressive multiplier that could cause issues:
- Each worker is a goroutine with its own stack (minimum 2KB, grows to 1GB)
- 512 concurrent blob store reads could overwhelm file descriptors, network connections, or disk I/O queues
- The comment mentions "~97% I/O" but provides no benchmark data to support the 64x multiplier
Consider:
- Adding benchmark results showing that 64x actually improves performance over lower values (8x, 16x, 32x)
- Making this configurable via settings with a more conservative default (e.g., 16x)
- Testing with actual production block sizes to ensure this doesn't cause resource exhaustion
The previous errgroup implementation used getValidationConcurrency() which appears to use a different calculation - was that measured to be insufficient?
services/blockvalidation/catchup.go
Outdated
| // This blocks until the previous session's validation completes | ||
| u.logger.Infof("[catchup][%s][session:%s] waiting for validation semaphore", catchupCtx.blockUpTo.Hash().String(), catchupCtx.sessionID) | ||
| select { | ||
| case u.validationSemaphore <- struct{}{}: |
There was a problem hiding this comment.
Missing test coverage for validation semaphore blocking behavior
The validation semaphore is critical for ensuring sequential validation across concurrent catchup sessions, but I don't see tests that verify:
- Multiple sessions can download concurrently (as intended)
- Only one session can validate at a time (enforced by semaphore)
- Sessions properly release the semaphore even when validation fails
- Sessions handle context cancellation while waiting for the semaphore
The existing test TestCatchup_ConcurrentCatchupSessions only checks session registration/unregistration, not the actual semaphore blocking behavior. Consider adding integration tests that verify these concurrent behaviors.
| # can be dangerous if the instance crashes or the storage can't keep up at all | ||
| # monitor the queue with `asadm -e "show statistics like write_q"` | ||
| max-write-cache 1024M | ||
| max-write-cache 4096M |
There was a problem hiding this comment.
Significant Aerospike tuning changes without justification
This PR changes several critical Aerospike performance parameters:
- post-write-cache: 256 -> 1024 (4x increase)
- defrag-sleep: 2000 -> 10000 (5x slower defragmentation)
- flush-max-ms: 1000 -> 5000 (5x longer flush delay)
- max-write-cache: 1024M -> 4096M (4x larger write buffer)
These changes significantly affect UTXO store write performance and memory usage. While the PR title mentions ValidateMulti(), the Aerospike tuning seems unrelated to that feature.
Questions:
- Are these changes necessary for the worker pool/concurrent catchup features?
- Were these settings benchmarked with production workloads?
- Should these be in a separate PR focused on storage optimization?
- What happens on systems with less available RAM when max-write-cache is 4GB?
No description provided.